Support vector data description with kernel density estimation (SVDD-KDE) control chart for network intrusion monitoring

Multivariate control charts have been applied in many sectors. One of the sectors that employ this method is network intrusion detection. However, the issue arises when the conventional control chart faces difficulty monitoring the network-traffic data that do not follow a normal distribution as required. Consequently, more false alarms will be found when inspecting network traffic data. To settle this problem, support vector data description (SVDD) is suggested. The control chart based on the SVDD distance can be applied for the non-normal distribution, even the unknown distributions. Kernel density estimation (KDE) is the nonparametric approach that can be applied in estimating the control limit of the non-parametric control charts. Based on these facts, a multivariate chart based on the integrated SVDD and KDE (SVDD-KDE) is proposed to monitor the network's anomaly. Simulation using the synthetic dataset is performed to examine the performance of the SVDD-KDE chart in detecting multivariate data shifts and outliers. Based on the simulation results, the proposed method produces better performance in detecting shifts and higher accuracy in detecting outliers. Further, the proposed method is applied in the intrusion detection system (IDS) to monitor network attacks. The NSL-KDD data is analyzed as the benchmark dataset. A comparison between the SVDD-KDE chart with the other IDS-based-control chart and the machine learning algorithms is executed. Although the it has high computational cost, the results show that the IDS based on the SVDD-KDE chart produces a high accuracy at 0.917 and AUC at 0.915 with a low false positive rate compared to several algorithms.

Network, computers, and technology play a significant part in daily life.However, network attacks have disturbed their merits in recent years.The intrusion detection system (IDS) is a functional security component that inspects the network connections and prevents suspicious packages 1 .Many studies related to intrusion detection have been carried out using machine learning methods.Several algorithms of machine learning have been applied in IDS, such as naïve Bayes (NB) 2,3 , logistic regression (LR) 4,5 , decision tree (DT) 6 , random forest (RF) [7][8][9] , and support vector machine (SVM) 3,4 , support vector data description (SVDD) 10,11 , convolutional neural network (CNN) 12,13 , recurrent neural network (RNN) 14,15 , and long-short-term memory (LSTM) 16,17 .
Intrusion detection can be conducted by scanning anomalies or suspicious network traffic patterns 18 .These network anomalies can be analogized as out-of-control samples or outliers in monitoring quality using a control chart.Hence, the statistical process control (SPC) method, especially the multivariate chart, can be utilized in IDS 19 .The utilization of the IDS-based multivariate control chart in inspecting the network traffic anomalies can be a powerful tool to protect the safety and reliability of the network 20 .
Several types of research have been performed in applying multivariate control charts in IDS.Abdel-Aziz et al. 21used the multivariate chart for network anomaly monitoring.The combination of the T 2 chart with successive difference covariance matrix (SDCM) for IDS shows acceptable results for finding network attacks 22 .IDS-based Robust Hotelling's T 2 chart using the adaptive control limit with kernel density estimation (KDE) displayed a faster computational time without lowering accuracy and precision 23 .The PCA-based T 2 chart using the robust estimator fast minimum covariance determinant (FMCD) and KDE control limit a lower False Negative and higher accuracy than the other charts 24 .The PCA Mix and Kernel PCA (KPCA) Mix control chart perform better in detecting network anomalies than the other methods 25,26 .
Although it has been widely used, there are some issues with the IDS-based multivariate control chart.Majority of the multivariate charts are developed under a certain distribution as stated by Ahsan et al. 27 .Zhu highlighted that the network traffics hard to have the multivariate normal distribution caused by extreme values from the intrusions 28 .As a consequence, there will be many false alarms occur.
Furthermore, most of the multivariate control charts used in IDS are Hotelling's T 2 .However, the statistic of T 2 can be easily affected by the outliers 29 .As a result, its ability to detect anomalies can be decreased 30 .These conditions threaten the security and stability of the system because the system has a lower detection rate and produces more false alarms 23 .
To overcome this situation, the support vector data description (SVDD) algorithms can be applied to increase the detection rate and solve the problem of non-normality.SVDD is a single-category label developed based on the SVM method to detect outliers.This method was originally proposed by 31 .The SVDD-based control chart can be used when the distributions of quality characteristics are relatively varied or even unknown.Using kernel functions in SVDD can create boundaries that follow the normal data connection data pattern without having to follow a certain distribution.By using this method, more anomalies or attacks can be identified.
Furthermore, the utilization of KDE can be an alternative to solve the high false alarm issue.This method can create the control limit by using the information of the normal connection data.Some IDS have been employing this method 23,24,32 .The capability of KDE method in estimating the empirical distribution from various types of data patterns, decreasing false positives rate or swamping effect of the IDS proposed.
Based on the problems above, this research suggests the combination of multivariate chart based on SVDD and KDE (SVDD-KDE) to inspect the anomalies in the network.First, the SVDD-KDE chart's performance is assessed to detect process shifts using the average run length (ARL) criterion.Further, the performance of the proposed SVDD-KDE chart is also examined to detect the outlier.Finally, the proposed SVDD-KDE chart is employed to observe the synthetic dataset and network traffic.The NSL-KDD dataset is used as the benchmark of the IDS.Also, the performance of the proposed IDS based on the SVDD-KDE chart is compared to several control charts and machine learning algorithms.
The remains of this paper are composed as follows: The procedures of the proposed SVDD-KDE chart are elaborated in Section "Proposed SVDD-KDE chart".The performance of the proposed SVDD-KDE chart is provided in Section "SVDD-KDE control chart performance".Section "IDS-based proposed control chart algorithm" discusses the proposed IDS algorithms.The utilization of the proposed IDS based on the SVDD-KDE chart in detecting network anomalies is discussed in Section "Application for monitoring network anomaly".In the end, Section "Conclusions" is assigned for the conclusions and suggestions for future research.

Support vector data description (SVDD)
Let x i = [x i1 , x i2 , ..., x ip ]′, where i = 1, 2, ..., n, be a column vector with dimension p, where x i are the training data.To fit the sphere around the target data, the sphere is determined by the quadratic programming solution as follows: subject to where F, a, and R are the cost functions for minimizing the center and sphere radius, respectively.The slack variable that allows the outlier detection in the training data is symbolized as ς i .If κ > 0 is a penalty parameter which supervises the change from volume sphere and misclassification, Eq. ( 2) can be substituted into Eq.(1) with the Lagrange multipliers as follows: where α * i ≥ 0 and γ i ≥ 0 .The dual problem in Eq. ( 2) is rewritten into the following equation: subject to where 0 ≤ α * i ≤ κ, .The distance among the support vectors and the hypersphere center is called the hypersphere radius and is formulated as follows: (1) where x k are the support vectors.Furthermore, the distance from the test data z to the center of the hypersphere needs to be calculated.The inner product x i • x j in Eqs. ( 4) and ( 6) can be replaced with a kernel function to make the SVDD method more flexible for outlier detection.The formula for the calculation is defined: In this research, the kernel function applied in SVDD is the radial basis function (RBF) kernel and is expressed as: where w is the hyperparameter of RBF kernel.The distance D 2 is used as the statistics plotted on the proposed control chart, and its control limit is calculated using the KDE method.

Kernel density estimation
The KDE can be used in estimating the empirical probability density function (pdf) from an unspecified distribution of random variables.Under the in-control state, the empirical D 2 distribution can be estimated using KDE to compute its control limit.The kernel function is adopted in order to estimate the empirical distribution of the D 2 statistic as follows: where ρ and K define the estimated smoothing parameter or bandwidth and the kernel function, respectively.To calculate the KDE control limit, the Gaussian Kernel is employed in this analysis.The control limit of SVDD-KDE control chart is estimated from ( 100(1 − α)-th) percentile of D 2 empirical distribution and is determined using the following expression: where α is the false alarm rate.

SVDD-KDE control chart performance
This section presents the proposed SVDD-KDE chart's performance.Three kinds of evaluation are conducted: performance in detecting process shift, performance in detecting outlier, and performance in monitoring the synthetic dataset.

Performance for detecting process shift
This subsection presents the performance evaluation of the proposed SVDD-KDE control chart in identifying process shifts.The simulation study is conducted to evaluate the performance of the proposed chart using the average run length (ARL) criterion.If the mean vector is written as µ and covariance matrix is expressed as , the data X are generated following the multivariate normal distribution with µ = 0 and =I , or in other terms X ∼ N p (0, I) .When the process is in-control (shift δ = 0 ) the ARL 0 is utilized to assess the performance of the SVDD-KDE chart.The target of ARL 0 for this simulation study is 370 which refers to the 3-sigma rule.Furthermore, the ARL 1 is calculated by increasing the mean vector for each variable characteristic µ shift = µ + δ, where δ µ = 0.1 = [0.1 0.1 ... 0.1] ′ 1×p .The SVDD-KDE chart's performance is evaluated and is compared with Hotelling's T 2 chart.
Table 1 presents the performance comparison of the proposed SVDD-KDE chart with Hotelling's T 2 chart for p = 2, 3, 5, and 7. When there is no shift in the process, both charts produce a similar ARL 0 ≈ 370(bold value in table).For the shifted process, the SVDD-KDE chart has a preferable performance to Hotelling's T 2 chart for a small and large shift which can be noticed from the lower value of ARL 1 .Also, it can be that performance of the SVDD-KDE chart gets better as the number of quality characteristics gets larger.
Table 2 presents the SVDD-KDE chart's performance for several types of correlation.For the in-control process, it is visible that the proposed chart yields the stable ARL 0 at about 370.For the shifted process, it can be noticed that the SVDD-KDE chart performs better for the higher correlation in detecting the small process shift.On, for the smaller correlation, the SVDD-KDE chart has a better performance in identifying the larger process shift.

Performance for detecting outlier
In this subsection, the proposed SVDD-KDE chart's performance is appraised for the different kinds of outliers.The percentages of outliers ε that are contaminated with the clean or normal data are 5%, 10%, 15%, 20%, 30%, and 50% over the number of observations.Similar to the previous subsection, the clean data X clean are generated following multivariate normal distribution with a µ clean = 0 and =I , X clean ∼ N p (µ clean , I) .The experimental studies are done for different quality characteristics, such as p = 3, 5, 10, 15, 20, and 30.The contaminated data X cont are generated following the Multivariate Normal Distribution with Table 3 tabulated the confusion matrix for detecting outliers.The proposed SVDD-KDE chart's performance in detecting outliers from the simulated data is assessed by 3 metrics as follows: (1) Hit Rate = True Positive (TP)+True Negative (TN) n , where n is the number of observations.
In calculating the FN Rate, FP Rate, and Hit Rate the simulation is repeated 1000 times.Tables 4, 5, 6, 7, 8 and 9 present the performance comparison of Hotelling's T 2 and the SVDD-KDE chart for detecting outliers.From the simulation results, it can be seen that for the number of outliers contaminated with the clean data lower than      Step 1: Specify hyperparameter of RBF Kernel w and false alarm rate α.
Step 2: Create a matrix X normal , which contains the normal connection data.
Step 3: Calculate statistics D 2 from normal labeled data X normal using Eq. (7).
Step 4: Estimate the KDE control limit using CL Kernel from Eq. (10).

Phase II: testing and detection phase
The estimated hyperparameter values, mean of the in-control D 2 , and CL Kernel from training in Phase I are used in this phase.The procedures of the detection phase are defined as follows: Step 1: Create a matrix X test , which is the new connection data.
Step 2: Calculate statistics D 2 by testing new connection data using the hyperparameter from phase I.   www.nature.com/scientificreports/ Step 3: If D 2 i > CL Kernel then the connection is an intrusion and if D 2 i < CL Kernel then the connection is normal for i = 1, 2, ..., n.

Application for monitoring network anomaly NSL-KDD dataset
This subsection presents a summary of the dataset used in this research.The NSL-KDD data is exploited in this paper to reveal the performance of the proposed SVDD-KDE chart in observing the network connection data.Table 11 gives the summary of the NSL-KDD dataset.

Performance of the proposed IDS
In this subsection, the selection of a hyperparameter is conducted to find the best hyperparameter for the NSL-KDD dataset according to the hit rate value.From Table 12, it can be concluded that the lower value of w will produce a false alarm which can be seen from the high value of the FP rate.On the other hand, the larger value of w will reduce the ability of the proposed chart to detect the intrusion (higher FN rate).From the results, it can be concluded that w = 1 yields the higher Hit rate with balanced FP and FN rates.This also confirms the results of simulation studies in Section 3.3.

Comparison with the several IDS-based-control charts
This subsection elaborates the performance comparison of the proposed IDS based on the SVDD-KDE with several charts, such as Hotelling's T 2 , and SDCM-based Hotelling's T 2 with several control limits as in (Ahsan et al.,  2018).SDCM-F uses the F distribution control limit, SCDM-CH uses the Chi-square control limit, SDCM-SW uses the Sullivan and Woodall control limit 33 , and SDCM-MY uses the Mason and Young control limit 34 .Also, the SVDD-KDE chart is compared with Robuts Hotelling's T 2 chart with KDE, and Fast minimum covariance determinant (MCD) estimator (written as Fast MCD T 2 ).
Table 13 tabulates the comparison of the proposed IDS based on SVDD-KDE with several IDS-based control charts.The results show that yields a similar Hit Rate SDCM-MY and Fast MCD T 2 .Compared to the SDCM-MY, the proposed IDS produces a smaller false alarm.Also, the proposed chart almost yields a similar result with IDS-based Fast MCD T 2 .Hence, it is deduced that the proposed SVDD-KDE chart has a higher accuracy and AUC in detecting intrusions with a lower false alarm.The drawbacks of the proposed SVDD-KDE is the high computational time.

Comparison with the several machine learning algorithms
This subsection discusses the performance comparison of the proposed chart with the other machine learning algorithms in monitoring the NSL-KDD dataset.The proposed IDS is compared with several machine learning algorithms such as the support vector machine, naïve Bayes, logistic regression, and decision tree.Based on the the objective, data preparation, control chart construction, identifying problems, and performing corrections for system improvement.Furthermore, the algorithms of the IDS based on the SVDD-KDE chart are split into 2 phases as follows:

Figure 1 .
Figure 1.Performance of the SVDD-KDE chart in monitoring simulated data in the S1 for: (a) Small shift, (b) Moderate shift, and (c) Large shift.

Figure 2 .
Figure 2. Performance of the SVDD-KDE chart in monitoring simulated data in S2 for: (a) Small shift, (b) Moderate shift, and (c) Large shift.

Figure 3 .
Figure 3. Performance of the SVDD-KDE chart in monitoring simulated data in S3 for: (a) Small shift, (b) Moderate shift, and (c) Large shift.

Figure 4 .
Figure 4. Performance of the SVDD-KDE chart in monitoring simulated data in S4 for: (a) Small shift, (b) Moderate shift, and (c) Large shift.

Figure 5 .
Figure 5. Performance of the SVDD-KDE chart in monitoring simulated data in S5 for: (a) Small shift, (b) Moderate shift, and (c) Large shift.

Figure 6 .
Figure 6.Performance of the SVDD-KDE chart in monitoring simulated data in S6 for: (a) Small shift, (b) Moderate shift, and (c) Large shift.

Figure 7 .
Figure 7. Intrusion detection system using a control chart method 19 .

Table 1 .
) Fasle Positive (FP) Rate = Performance for Different Quality Characteristics.Significant values are in bold.

Table 2 .
Performance of the proposed chart for different correlation.Significant values are in bold.

Table 10 .
Scenarios of Simulated data.

Table 12 .
Performance of proposed IDS for different hyperparameters.Significant values are in bold.

Table 13 .
Performance comparison with several control charts.Significant values are in bold.

Table 14 .
Performance comparison with several machine learning algorithms.Significant values are in bold.